In this markdown, we analyze this data of Forbes Billionaires.
In our research, we would like to examine different aspects in the Billionaires society, using the statistic measures we learned during the course. More particularly, we will focus on two main questions:
Which parameters may affect the wealth of the billionaire?
Does the variance of wealth vary between different cultures, and why?
Forbes_billionaires <- readr::read_csv("C:/Users/Aharon Malkin/Downloads/forbes_billionaires.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Name = col_character(),
## NetWorth = col_double(),
## Country = col_character(),
## Source = col_character(),
## Rank = col_double(),
## Age = col_double(),
## Residence = col_character(),
## Citizenship = col_character(),
## Status = col_character(),
## Children = col_double(),
## Education = col_character(),
## Self_made = col_logical()
## )
Let’s take a quick overview using the head function.
head(Forbes_billionaires)
## # A tibble: 6 x 12
## Name NetWorth Country Source Rank Age Residence Citizenship Status
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Jeff Be~ 177 United ~ Amazon 1 57 Seattle, ~ United Sta~ In Rel~
## 2 Elon Mu~ 151 United ~ Tesla, ~ 2 49 Austin, T~ United Sta~ In Rel~
## 3 Bernard~ 150 France LVMH 3 72 Paris, Fr~ France Married
## 4 Bill Ga~ 124 United ~ Microso~ 4 65 Medina, W~ United Sta~ Divorc~
## 5 Mark Zu~ 97 United ~ Facebook 5 36 Palo Alto~ United Sta~ Married
## 6 Warren ~ 96 United ~ Berkshi~ 6 90 Omaha, Ne~ United Sta~ Widowe~
## # ... with 3 more variables: Children <dbl>, Education <chr>, Self_made <lgl>
The dataset was in a CSV file- a convenient format to work with. The dataset was pretty concise, without any major filtration needed.
Our dataset includes the following fields:
We will delete, the “Source” column, that describes the major income source of the billionaire as specific company names. In our opinion, the values in the column, cannot lead to significant conclusions because they aren’t grouped by specific categories, therefore we ignored this section.
In addition, we will delete “Residence” column, because it’s irrelevant to our research, as we will focus on the citizenship of the billionaire.
Forbes_cut<- select(Forbes_billionaires, -Residence, -Source)
head(Forbes_cut)
## # A tibble: 6 x 10
## Name NetWorth Country Rank Age Citizenship Status Children Education
## <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 Jeff B~ 177 United ~ 1 57 United Sta~ In Re~ 4 Bachelor of~
## 2 Elon M~ 151 United ~ 2 49 United Sta~ In Re~ 7 Bachelor of~
## 3 Bernar~ 150 France 3 72 France Marri~ 5 Bachelor of~
## 4 Bill G~ 124 United ~ 4 65 United Sta~ Divor~ 3 Drop Out, H~
## 5 Mark Z~ 97 United ~ 5 36 United Sta~ Marri~ 2 Drop Out, H~
## 6 Warren~ 96 United ~ 6 90 United Sta~ Widow~ 3 Master of S~
## # ... with 1 more variable: Self_made <lgl>
Great, now we would like to create few tables according to our main research questions.
First, let’s pick the education section, which we assume may have an interesting affection on the billionaire’s wealth. We will make a filtered version of our data, and exclude the NA values in the “Education” column.
Edu_cut<- Forbes_cut%>% drop_na("Education")
glimpse(Edu_cut)
## Rows: 1,409
## Columns: 10
## $ Name <chr> "Jeff Bezos", "Elon Musk", "Bernard Arnault & family", "Bi~
## $ NetWorth <dbl> 177.0, 151.0, 150.0, 124.0, 97.0, 96.0, 93.0, 91.5, 89.0, ~
## $ Country <chr> "United States", "United States", "France", "United States~
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 18, 19, 20,~
## $ Age <dbl> 57, 49, 72, 65, 36, 90, 76, 48, 47, 64, 65, 49, 81, 71, 72~
## $ Citizenship <chr> "United States", "United States", "France", "United States~
## $ Status <chr> "In Relationship", "In Relationship", "Married", "Divorced~
## $ Children <dbl> 4, 7, 5, 3, 2, 3, 4, 1, 3, 3, 3, NA, 6, NA, 4, 3, 2, NA, 4~
## $ Education <chr> "Bachelor of Arts/Science, Princeton University", "Bachelo~
## $ Self_made <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FAL~
Another interesting parameter, is the age of the billionaire, does older billionaires reach higher n worth levels?
Age_cut<- Forbes_cut%>% drop_na("Age")
glimpse(Age_cut)
## Rows: 2,630
## Columns: 10
## $ Name <chr> "Jeff Bezos", "Elon Musk", "Bernard Arnault & family", "Bi~
## $ NetWorth <dbl> 177.0, 151.0, 150.0, 124.0, 97.0, 96.0, 93.0, 91.5, 89.0, ~
## $ Country <chr> "United States", "United States", "France", "United States~
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Age <dbl> 57, 49, 72, 65, 36, 90, 76, 48, 47, 64, 85, 67, 66, 65, 49~
## $ Citizenship <chr> "United States", "United States", "France", "United States~
## $ Status <chr> "In Relationship", "In Relationship", "Married", "Divorced~
## $ Children <dbl> 4, 7, 5, 3, 2, 3, 4, 1, 3, 3, 3, 2, NA, 3, NA, 6, NA, 4, 3~
## $ Education <chr> "Bachelor of Arts/Science, Princeton University", "Bachelo~
## $ Self_made <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FAL~
One more intresting parameter is amount of children. Does it effect the wealth of a billionaire?
Children_cut<- Forbes_cut%>% drop_na("Children")
glimpse(Children_cut)
## Rows: 1,552
## Columns: 10
## $ Name <chr> "Jeff Bezos", "Elon Musk", "Bernard Arnault & family", "Bi~
## $ NetWorth <dbl> 177.0, 151.0, 150.0, 124.0, 97.0, 96.0, 93.0, 91.5, 89.0, ~
## $ Country <chr> "United States", "United States", "France", "United States~
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 16, 18, 19, 20,~
## $ Age <dbl> 57, 49, 72, 65, 36, 90, 76, 48, 47, 64, 85, 67, 65, 81, 72~
## $ Citizenship <chr> "United States", "United States", "France", "United States~
## $ Status <chr> "In Relationship", "In Relationship", "Married", "Divorced~
## $ Children <dbl> 4, 7, 5, 3, 2, 3, 4, 1, 3, 3, 3, 2, 3, 6, 4, 3, 2, 4, 5, 2~
## $ Education <chr> "Bachelor of Arts/Science, Princeton University", "Bachelo~
## $ Self_made <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FAL~
Now, according to our second question, we would like to filter the NA values in the “Citizenship” column.
City_cut<- Forbes_cut%>% drop_na("Citizenship")
glimpse(City_cut)
## Rows: 2,739
## Columns: 10
## $ Name <chr> "Jeff Bezos", "Elon Musk", "Bernard Arnault & family", "Bi~
## $ NetWorth <dbl> 177.0, 151.0, 150.0, 124.0, 97.0, 96.0, 93.0, 91.5, 89.0, ~
## $ Country <chr> "United States", "United States", "France", "United States~
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Age <dbl> 57, 49, 72, 65, 36, 90, 76, 48, 47, 64, 85, 67, 66, 65, 49~
## $ Citizenship <chr> "United States", "United States", "France", "United States~
## $ Status <chr> "In Relationship", "In Relationship", "Married", "Divorced~
## $ Children <dbl> 4, 7, 5, 3, 2, 3, 4, 1, 3, 3, 3, 2, NA, 3, NA, 6, NA, 4, 3~
## $ Education <chr> "Bachelor of Arts/Science, Princeton University", "Bachelo~
## $ Self_made <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FAL~
counter= City_cut %>% group_by(Country) %>%
count() %>% filter(n > 20)
country1 <- ggplot (data = counter, aes(x = reorder(Country,-n),y=n)) +
theme_bw()+
theme(axis.text.x = element_text(angle = 60, hjust = 1)) + #adjusting the names in a good angle
geom_col(fill = "turquoise4",color="black") +
labs(x = "Country", y= "Amount", title="Amount Of Billionaires For Each Country")
country1
The map displays the range of means between different countries from the highest mean (colored blue) to the lowest (colored turquoise)
#setting dataframe for world map
by_NetWorth <- City_cut %>%
group_by(Citizenship) %>%
summarize(NetWorth = mean(NetWorth)) %>%
arrange(desc(NetWorth))
# merging world map data with our's
NetWorth_map <- joinCountryData2Map(by_NetWorth,
joinCode = "NAME",
nameJoinColumn = "Citizenship"
)[-which(getMap()$ADMIN=="Antarctica"),]
## 68 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 175 codes from the map weren't represented in your data
NetWorth_map <- merge(NetWorth_map, by_NetWorth, by = "Citizenship")
# plotting
NetWorth_map_params <- mapCountryData(NetWorth_map,
nameColumnToPlot="NetWorth.y",
mapTitle = "Mean Billionaire NetWorth",
oceanCol = "#DAFDFF",
catMethod = "categorical",
missingCountryCol = "white",
colourPalette = mako(n = 68, begin = 0.2, end = 0.8, direction = -1),
addLegend = FALSE,
border = "black",
lwd = 1)
colorlegend(posx = c(0.05, 0.08),
left = TRUE,
col = mako(n = 68, begin = 0.2, end = 0.8, direction = -1),
zlim = c(1,11),
digit = 1,
zval=c(1,11))
To make things a bit tidier, we decided to divide our visualizations based on different topics
Let’s show some interesting plots regarding the education of the billionaires.
Then, we did a little transformation for several variables
Education<- Edu_cut$Education
BA<- Education[str_detect(Education, ("Bachelor|LLB")) & !str_detect(Education, "Master|EMBA")]
diploma<-Education[str_detect(Education, ("Diploma|High School|Associate")) & !str_detect(Education, "Bachelor|Master|EMBA|Doctorate|Doctor|Ph.D|Drop Out")]
drop_out<-Education[str_detect(Education, ("Drop Out|drop out")) & !str_detect(Education, "Bachelor")]
master<-Education[str_detect(Education, ("Master|EMBA")) & !str_detect(Education, "Doctorate|Doctor|Ph.D|Drop Out")]
Doctor<-Education[str_detect(Education, ("Doctorate|Doctor|Ph.D"))]
Edu_cut <- Edu_cut %>%
mutate(edu_lvl = ifelse(Education %in% BA,"BA",ifelse(Education %in% diploma,"DIPLOMA/ASSOCIATE",ifelse(Education %in% drop_out,"DROP OUT",ifelse(Education %in% master,"MASTER",ifelse(Education %in% Doctor,"DOCTOR", "UNKNOWN"))))))
Let’s check if we received the correct values on the “edu_lvl” column
unique(Edu_cut$edu_lvl)
## [1] "BA" "DROP OUT" "MASTER"
## [4] "DOCTOR" "UNKNOWN" "DIPLOMA/ASSOCIATE"
We received some values which categorized as “Unknown”, i.e values that include the name of the university but didn’t mention which degree. However, there are only few of these values, therefore we decided to filter them from the following plots
plot_edu_lvl <- ggplot (data =Edu_cut%>%filter(!edu_lvl=="UNKNOWN"), aes(x = fct_infreq(edu_lvl),fill=edu_lvl)) +
theme_bw()+
theme(axis.text.x = element_text(angle = 60, hjust = 1)) + #adjusting the names in a good angle
geom_bar(color = "black") +
labs(x = "Education Level", y= "Amount")+scale_fill_brewer(name="Education Level",palette="GnBu")
plot_edu_lvl
Edu_cut %>% filter(!edu_lvl=="UNKNOWN") %>%
group_by(edu_lvl) %>%
summarise(mean_NetWorth = mean(NetWorth))
## # A tibble: 5 x 2
## edu_lvl mean_NetWorth
## * <chr> <dbl>
## 1 BA 6.12
## 2 DIPLOMA/ASSOCIATE 4.29
## 3 DOCTOR 4.06
## 4 DROP OUT 9.8
## 5 MASTER 5.62
Edu_cut %>% filter(!edu_lvl=="UNKNOWN") %>%
group_by(edu_lvl) %>%
summarise(median_NetWorth = median(NetWorth))
## # A tibble: 5 x 2
## edu_lvl median_NetWorth
## * <chr> <dbl>
## 1 BA 2.6
## 2 DIPLOMA/ASSOCIATE 2.7
## 3 DOCTOR 2.9
## 4 DROP OUT 2.9
## 5 MASTER 2.8
Let’s visualize this data with Boxplot:
box_edu <- ggplot(Edu_cut %>% filter(!edu_lvl=="UNKNOWN"), aes(x=edu_lvl,y=NetWorth , fill=edu_lvl)) +
geom_boxplot(alpha=0.7, ) + scale_y_log10() +
stat_summary(fun.y=mean, geom="point", shape=20, size=7, color="black", fill="black") +
theme(legend.position="none") + labs(x="Education Level",y="Net Worth")+
scale_fill_brewer(name="Education Level",palette="GnBu")
box_edu
We have decreased the net worth to normalize critical values, such as the first 3 billionaires. Note: the outcome seems to resemble chi square distribution
ggplot(data=Edu_cut %>% sample_n(500), aes(x=NetWorth))+ scale_x_log10() +
geom_density(fill="turquoise4",alpha=.4)
We chose 2 representing countries which we would like to distinct the difference between billionaires in those countries, interesting conclusions might come up….
First, let’s filter the data so only billionaires from US and China will be included.
x <-City_cut%>%
filter(Citizenship=="China" | Citizenship== "United States")
unique(x$Citizenship)
## [1] "United States" "China"
Now, we wanted to check which Citizenship has a larger representation in the billionaires list. as we can see, in a filtered data which contains only China and US billionaires, there are more American billionaires, which means that there are a little more American Billionaires then Chinese in total.
colors = c("lightcoral","lightblue1")
data= x %>%
group_by(Citizenship) %>%
summarize(counts = n(),
percentage = n()/nrow(x))
pie= plot_ly(data = data, labels = ~Citizenship, values = ~percentage, type = 'pie', sort= FALSE,
marker= list(colors=colors, line = list(color="black", width=1)))
pie
Note: the amount of billionaires of China that their education level is mentioned, is only 30% from the total amount of Chinese billionaires.
new<- Edu_cut %>% filter(Citizenship=="China")
plot1 <- ggplot (data =new, aes(x = fct_infreq(edu_lvl),fill=edu_lvl)) +
theme_bw()+
theme(axis.text.x = element_text(angle = 60, hjust = 1)) + #adjusting the names in a good angle
geom_bar(color = "black") +scale_fill_brewer(name="Education level",palette="GnBu")+
labs(x = "Education Level", y= "Amount", title="Amount Of Chinese Billionaires For Each Education Level")
new1<- Edu_cut %>% filter(Citizenship=="United States")
plot2 <- ggplot (data =new1, aes(x = fct_infreq(edu_lvl),fill=edu_lvl)) +
theme_bw()+
theme(axis.text.x = element_text(angle = 60, hjust = 1)) + #adjusting the names in a good angle
geom_bar(color = "black") + scale_fill_brewer(name="Education Level",palette="GnBu")+
labs(x = "Education Level", y= "Amount", title="Amount Of American Billionaires For Each Education Level")
plot1
plot2
These plots represent an interesting perspective on the differences between American and Chinese billionaires. As it seems, there is a distinctively disparity between the American and Chinese median, mean and variance, in meanings of net worth. We might want to check that later, by using one of the models.
x %>% group_by(Citizenship) %>%
summarise(mean_NetWorth = mean(NetWorth))
## # A tibble: 2 x 2
## Citizenship mean_NetWorth
## * <chr> <dbl>
## 1 China 4.06
## 2 United States 6.10
box_plot_china_us <- ggplot(City_cut %>% filter(Country=="China"| Country=="United States" ), aes(x=Citizenship,y= NetWorth , fill=Citizenship)) +
geom_boxplot(alpha=0.7) + scale_y_log10()+
stat_summary(fun.y=mean, geom="point", shape=20, size=7, color="black", fill="black") +
theme(legend.position="none") +
scale_fill_brewer(palette="GnBu")
box_plot_china_us
dens<-ggplot(x, aes(x=NetWorth, fill=Citizenship))+ scale_x_log10()+ geom_density(alpha=0.4)+scale_fill_brewer(name="Country",palette="GnBu")
dens
After visualizing all kind of variables which we thought could have effected one’s net worth, we chose few of them to test our assumptions.
After visualizing the density of the net worth, we suspected that this density resembles chi square distribution because of the high density on the left part of the plot. In order to examine this hypothesize, we would like to perform a goodness of fit chi square test, with a null hypothesize that the net worth of billionaires distribute with chi square distribution.
\[ H_0:Net Worth \sim \mathcal{\chi^2_{0.95, df}}\\ H_1:Net Worth \sim else\\ \]
First, we will find the amount of degrees of freedom that distributes in the most similar way to our net worth distribution,by comparing chi square distribution to the net worth distribution, using geom_qq.
ggplot(Forbes_cut, aes(sample=log10(NetWorth)*10)) + geom_qq(distribution = qchisq , dparams = list(df=5))
We can see that with df=5, the y axis is quite equal to the x axis, so it fits.
set.seed(0)
log_nw=(log10(Forbes_cut$NetWorth)*10)
nw_breaks <- c(0, 3, 4, 5, 7, 9, 12, 25)
Forbes_cut_gf <- Forbes_cut %>% mutate(nw_bin = cut(log_nw, breaks = nw_breaks, include.lowest=TRUE)) %>%
sample_n(900)
nw_chi_prep <- Forbes_cut_gf %>%
count(nw_bin, name = "observed") %>%
mutate(upper_bound = nw_breaks[-1]) %>%
mutate(lower_bound = nw_breaks[1:7]) %>%
mutate(expected_prob = pchisq(q = upper_bound, df= 5)-
pchisq(q = lower_bound, df=5)) %>%
mutate(expected_prob = expected_prob/sum(expected_prob)) %>%
mutate(expected = expected_prob*1000) %>%
mutate(chi_comp = (observed-expected)^2/expected)
nw_chi_prep
## # A tibble: 7 x 7
## nw_bin observed upper_bound lower_bound expected_prob expected chi_comp
## * <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 [0,3] 350 3 0 0.300 300. 8.31
## 2 (3,4] 133 4 3 0.151 151. 2.05
## 3 (4,5] 95 5 4 0.134 134. 11.1
## 4 (5,7] 126 7 5 0.195 195. 24.6
## 5 (7,9] 95 9 7 0.112 112. 2.47
## 6 (9,12] 60 12 9 0.0743 74.3 2.75
## 7 (12,25] 41 25 12 0.0347 34.7 1.16
chi2_0 <- sum(nw_chi_prep$chi_comp)
chi2_0
## [1] 52.44582
1-pchisq(q = chi2_0, df = 6)
## [1] 1.516912e-09
qchisq(0.95, df = 5)
## [1] 11.0705
chisq.test(x = nw_chi_prep$observed,p = nw_chi_prep$expected_prob)
##
## Chi-squared test for given probabilities
##
## data: nw_chi_prep$observed
## X-squared = 47.162, df = 6, p-value = 1.737e-08
\[ 52.44 = \chi^2_0 > \chi^2_{0.95, 5} = 11.07 \]
Although The chi 0 value isn’t big, the critical value is still smaller, thus we cannot accept the null hypothesis. Based on the qqplot, we understood that the distribution resembles the chi square distribution, but after we applied the Goodness of fit test, we will reject the null hypothesis. The results are interesting as they are showing a small statistical value and an impressive similarity to chi square distribution. We will consider researching them in future projects.
As we saw in the plots visualization between China and US, we assume that the variances of their net worth are not equal. To reject the hypothesis that the variances are equal, we will use the F test. We have done the test despite the fact that the data doesn’t
China_N<-c(City_cut %>% filter(Citizenship=="China") %>% sample_n(500) %>% select(NetWorth))
USA_N<-c(City_cut %>% filter(Citizenship=="United States") %>% sample_n(500) %>% select(NetWorth))
Now after we sampled 500 billionaires from each country, let’s do the test.
var.test(x = unlist(China_N),y = unlist(USA_N))
##
## F test to compare two variances
##
## data: unlist(China_N) and unlist(USA_N)
## F = 0.24704, num df = 499, denom df = 499, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.2072402 0.2944885
## sample estimates:
## ratio of variances
## 0.2470422
right_value <- qf(0.975, df1 = 499, df2 = 499)
right_value
## [1] 1.192057
left_value <- qf(0.025, df1 = 499, df2 = 499)
left_value
## [1] 0.8388858
\[ f^{(499,499)}_{0.025} = 0.8388858\\ f^{(499,499)}_{0.975} = 1.192057\\ F_0 = 0.2467489\\ F_0 < f^{(499,499)}_{0.025} \]
So according to the f test, The F statistic is smaller than the left critical value, therefore, with significant level of 95%, we will reject the null hypothesis and accept the alternative hypothesis, the variances of China and US aren’t equal.
Looking at our data, we have only one continuous variable that may effect the net worth-Age. Although we have heard of a few young billionaires, we infer that life experience has an effect on the wealth of a billionaire. So, we are interested to find out whether there is a correlation between age and one’s net worth.
\[ H_0:\beta{1}= 0\\ H_1:\beta{1} \neq 0\\ \]
fit1 <- lm(formula = log10(NetWorth)~Age , data = Forbes_cut %>% sample_n(1000))
summary(fit1)
##
## Call:
## lm(formula = log10(NetWorth) ~ Age, data = Forbes_cut %>% sample_n(1000))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50628 -0.26995 -0.08005 0.16866 1.80189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3235292 0.0563160 5.745 1.24e-08 ***
## Age 0.0021500 0.0008693 2.473 0.0136 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3586 on 953 degrees of freedom
## (45 observations deleted due to missingness)
## Multiple R-squared: 0.006378, Adjusted R-squared: 0.005335
## F-statistic: 6.117 on 1 and 953 DF, p-value: 0.01356
By looking at the regression summary we can see some interesting details.
1-the beta 1 value is really closed to 0 (one year adds 0.003 billion dollars to one’s net worth )
2-the R squared is also extremely low.
3-Tb1 & F0 are pretty high and indicates statistic significant.
Those results together, indicates that our linear regression model is significant but the correlation between age and new worth is very weak. The significant is probably effected by other factors. Therefore, we can determine that there is almost zero correlation between age and net worth. This geom point shows exactly what we’ve discovered.
ggplot(Age_cut,aes(x=Age,y=NetWorth))+geom_point(aes(color=NetWorth)) + scale_y_log10()+
scale_color_viridis(option = "D")
In this project, we examined two different questions. We discovered using statistical measures that there is no relation (by a significant level of 95%) between the age of a billionaire and his total net worth.
As we received the variance test results, which indicated that there is a difference between variances of net worth of China and the US billionaires, they made us wondering why this difference exists. In order to find out and to solve the riddle, We took a step further and tried to develop a theory. Based on information we collected in “Micro Organizational behavior”, a course we took last semester, we assumed that this difference is an outcome of cultural difference- while the American culture is individualistic, everyone to himself - the Chinese culture is collectivist - “One for all and all for One”. To have a second opinion on our assumption, we emailed Carmit Tadmor- our lecturer for Micro Organizational Behavior in the previous semester. Carmit specialized in cultural differences and their effect on one’s life. Carmit has explained to us that it is a bit difficult to draw significant conclusions from the data. After all, we did not take into account any control factors (like industry, poverty, etc.), and also the use of citizenship alone is a bit problematic because it is unknown where the person actually made his fortune. In her opinion, it seems that if we were to determine the gap between the two variancess it may be more related to power distance than cultural differences. While the US is very illegitimate and anyone can succeed, In China - there is a very hierarchical system and only those who receive approval from the government and are close to the plate can succeed.
You can read more about it in this linked article.
We are aware that some of our assumptions are a bit problematic because of a lack of data. Moreover, the billionaires are a very small group, therefore any billionaire that we had to filter due to lacking data had a big effect on our results and may cause some of our tests to be a little biased.
As for the GF model, The results were really interesting as they showed an impressive similarity to chi square distribution. We have seen it during the visualization part, and as the density visualization has shown, most of the billionaires, don’t have more then 1B$, therefore obviously the data isn’t normally distributed, but in order to continue our project and make further tests, we assumed that it distributes normally, relying on the Central limit theorem.
Finally, the project made us curious and increased our knowledge and experience in R and Statistics, we learned a lot during the work together and hope to expand our knowledge in the future.